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Abstract 


We  theoretically  derive  and  numerically  simulate  a 
new  phenomenon  called  se^-supervision,  in  which 
the  higher  layers  of  a  multilayer  luuupervised 
network  control  the  optimisation  of  the  lower  layers, 
even  when  there  is  no  external  supervising  teacher 
present.  Self-supervision  is  a  very  convenient  hybrid, 
which  combines  the  best  properties  of  unsupervised 
and  supervised  network  training  algorithms. 
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Figure  t. 

Communication  channel  encodugAdecodtag  uMcepmanon  at  •ccice  ouanoaaneim 
Figure  2. 

Two-uage  v«cior  tptaacuaoon  Tm  ctianaeU  a,  -*  y,  -*  a,'  and 
*:  fj  ■  y/  a,' are  coupled  through  a  common  channel  -•  a  -• 

Figure  3. 

Ensemble  average  two-stage  vecior  ouanosanon  nodelt  Ok  duHorecio 

due  to  the  ensemble  average  of  feasible  -*  a  -*  (it'-Tt*) 

Figure  4. 

The  marginal  PDFs  POrVyi-yil  “al  PD'j'JiJ;)  ot  Ok  cntemble  a«erB|x  di«on»t«r> 
determine  the  topographic  nei^tbourhood  funcoons  for  opoimswif  the 
*1  >1  y/  *t*and  a,  -*  jrj  jr/  -♦  a,' channels 

Figure  5 

Flowchart  showing  the  main  steps  in  nnwlanng  a  2  su|e  vector  quanuier.  with  the 
second  suge  implerrKnied  as  an  ensemble  average  vecior  quantiser  The  tecoon  in  the 
dashed  box  is  an  optional  minunum  distoruon  encoding  Kheme.  wtneh  refines  the 
encoding  found  by  the  nearest  neighbour  Kheme 

Figure  6, 

Typical  approximations  to  the  distoruon  fDF  P(y'|ly|.y;>  We  uk  a  very  simple 
prescription  in  which  P<y'|fy|.yj)  is  set  to  one  of  only  three  poswble  PDFs  atcordjng  to 
the  value  of  the  underlying  gradieni  G«dP(y,.yjV3y,  There  histogrami  may  be  aj^licd 
directly  to  the  suge  0  VQ's  as  topographic  neighbourhood  functions 

Figures  7-12. 

Plots  of  reconstruction  distortion  for  nearest  nei^bour  and  mininium  distortion  encoding 
inodes,  and  for  independent  and  correlated  channel  modes. 

Figure  13. 

Migration  of  the  PDF  POr,,yj)  due  to  the  self-tupervi»on  effect  of  the  marginal  PDFs 
P(y',ly,,yj)  and  Pfy'j'yi-yj)’  '"f'Kh  are  the  topographic  neighbouihood  functions  for 
optimising  the  a,  -♦  y,  •••  y,'  -*  a,'  and  aj  -*  Jj  Jj  -*  a,'  channels.  Contributions  to 
P(y,,y])  which  lie  inside  the  vertical  shaded  band  tend  to  migraK  towids  the  left,  and 
contributions  inside  the  horizontal  band  tend  to  move  upwards,  in  all  cases  the  migretion 
is  in  the  direction  in  which  the  conesponding  margina]  PDF  it  biased.  Conqiare  Rgure 
4. 

Figure  14. 

(a)  Determining  the  nearest  neighbour  code  vector  poridon  for  a  tingle  vector  quantiser 

(b)  Determining  the  PDF  P(y1y)  of  the  nearest  neighbour  code  vector  position  from  the 
code  vector  density  p(y)  of  an  ensemble  of  vector  quantisers. 
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1.  Introdiictioci 

A  cwwBcii  cnociwi  of  auupcrvttad  tdtptm  M(««rtu  u  Acs  poor  perfonataoe  » 
clautficr  actwQtfcs  (l«.  wppvMod  Mmcrtsl-  to  |l)  ikc  KXtUed  Icofowg  vooor 
quontiMBoo"  (LVQ)  neited  wm  wtredacod  io  miuch  an  nicraal  meter  topervinl  te 
OMtpwioftrocwquawMcrCVQXoodiiiwwrliiiomwbodttewdtiffHowipMt  Thtsisa 
hybrid  approach,  ostaf  both  omupoviaed  and  Mpervued  eiamtmt  lo  its  aratmns 
algondUB.  and  it  has  sm  with  bauad  weews  bacanac  ite  abilny  of  a  VQ  ao  tailor 
sopbiaikaied  class  bouadarics  is  rather  tmaod  Basically,  a  VQ  u  a  m(lc  toft  oenaorh 
with  very  bmiied  capabilttes.  tw  satsi  sat  a  VQ  ao  othaace  the 

performaace*. 

The  novel  result  that  wt  ptcaem  ui  tha  memanadian  is  that  it  is  not  aaoettary  to 
tatrodiicc  ao  aaiemal  aaacte  in  order  lo  ioBDduoe  supervinoo  into  an  nnsupermed 
network.  It  tarns  out  that  multilayer  ad^«c  networks  can  nperviae  ttemtelvet  even 
ihougk  they  are  ersined  overall  as  aon^ervued  networks  For  omphetty.  we  study  the 
theory  of  mult>tta|e  VQ  networks  that  we  developed  to  (2,  3.  4,  3].  ui  which  the  hifher 
layers  supply  feedback  signals  to  assist  in  the  opoinisation  of  the  lower  layers,  althou^ 
there  is  no  external  teacher  picseni>. 

Our  network  is  well  suited  to  the  problem  of  low-level  image  processing,  where 
information  at  each  length  scale  should  be  processed  in  the  light  of  contextual 
information  at  longer  length  Kales  There  are  many  ways  of  implementing  contextual 
processing,  but  our  multisage  approach  to  thu  problem  dissingiiishes  iuelf  by  scaling 
well  to  high-dimensional  problerm’. 

In  {6.  7|  we  successfully  apply  our  theoredcal  results  to  time  teriei  and  image 
compression,  respectively,  and  in  (8.  9|  we  solve  the  problem  of  detecting  ttsdstically 
anomalous  features  in  sadsdcally  honit^ieneous  backgrounds.  Self-supervisioo  should 
improve  the  performance  of  the  network  in  ihex  applications.  becauK  it  extends  the 
sequential  layer-by-layer  optimiution  that  we  used  in  (6.  7.  8,  9)  to  a  full  global 
opdmisacion  of  the  multilayer  network*.  Furthermore,  the  ’top-down'  information 
pathways  that  Klf-supervision  uks  are  the  tame  as  ti>r>K  icquired  by  an  LVQ-like 
supervised  network,  which  allows  us  easily  to  extend  our  approach  to  become  a  full 
classifier  network. 

In  Section  7  sve  present  a  simple  dtagrammadc  review  of  vector  quandudon.  and  its 
extension  to  2-stage  vector  qnandsadon.  In  Section  3  sve  perform  some  numerical 
sunuiadons  to  demonttraie  the  self-supervisioo  effeco  that  emerge  in  a  2-tiage  VQ. 

In  the  appendix  we  review  the  general  subject  of  VO'S,  and  sve  derive  the  ptopenict  of  2- 
fuge  VQ's  in  the  limit  of  a  large  codebook  size  (i.e.  the  continuum  liinit).  This 
derivation  is  rather  technical,  but  it  is  the  central  theoretical  result  upon  which  self- 
supervision  depends. 


*TIik  ■itomi  ii  MUaeMi  w  *«  ypitii  wp— > 

^Ihc  *Mfs  CM  be  ciwiWd  w  LVtyKta  IqrWM  fei  cMck  m  tmnti  ■crSic  ic  pic  we  IIcwmci.  *t  M  k  ewy  u 

euik  cT  a*  coarau  iHdict  m  iMiUjr  SM  iUI—  of  SMn  tcrutO  «r  *•  aWpiiw  kcmrft  Uki  Kc  keyoM  Su  IM  kkc  Sia  wc 
orlKiily  ikiafi 

^lackUccMkcKkkniiii  kpcifitiilli  »— kMceWe nOiiccaiikcracikk— cciikl cwm  Hkci«kkiirriWmB  lk*etk»M»« 
praroM  •>  keveke  dMw  Meu  ikio  ai  cxakkckM  ikki  hake  <MC*  kwn  lac  c  TMl  cami. 
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2.  Vector  quantisation  model 

In  the  qipendix  we  collect  together  tome  background  theory  on  VQ't.  In  this  section  we 
present  a  diagrammatic  summary  of  this  theory. 

2.1.  Diagrammatic  interpretation  of  vector  quantisation 

In  a  VQ  the  compression  and  reconstruction  process  may  be  interpreted  in  terms  of 
encoding  and  decoding  during  transmission  of  information  th^gh  a  noiseless 
communication  channel’,  as  shown  in  Figure  1. 


Figure  1.  Communication  channel  encoding/decoding 
interpretation  of  vector  quantisation. 

There  are  two  distinct  methods  of  enccxling  that  we  should  consider; 

1 .  Nearest  Neighbour  Encoding  (NN):  Select  the  codevector  that  lies  closest  to  the  input 
vector,  in  the  Euclidean  sense. 

2.  Minimum  Distortion  Encoding*  (MD):  Model  the  noise  on  the  communication 
channel,  then  select  the  ccxievcctor  that  on  average  produces  the  closest  estimate  of 
the  input  vector,  in  the  Eucliden  sense. 

We  can  use  NN  encoding  as  an  approximation  to  MD  encoding  when  the  signal-to-noisc 
rado  on  the  channel  is  large^;  this  approximation  is  exact  in  the  limit  of  vanishing 
channel  noise. 

We  may  generalise  the  simple  encoder/decoder  system  to  a  set  of  nested  VQ's.  For 
instance,  in  Figure  2  we  show  a  two-stage  VQ. 

In  Figure  2  we  transform  the  components  of  the  input  pair  x={Xi,*2)  ’o 
corresponding  components  of  the  pair  y=(y|,y2).  which  is  we  then  input  to  the  nested 
VQ,  whose  output  components  y'<=(y'|,y  2)  we  use  to  reconstruct  the  corresponding 
components  of  2)-  ^  present  an  approximate  method  of  training  the 

channels  x, -»y,  •••  y,' -» x,'  and  Xj yj  •  y^' -» Xj'  independently,  by  noodelling 
them  as  a  pair  of  noisy  channel  VQ's  (as  in  Equation  10).  Although  in  this  scheme  we 


’onr  itwhi  oouM  indMd  be  ^iplied  is  ihe  epiniiHikm  ef  VQt  foe  conmuniciiion  at  ■fomuiion  *na|h  aoiiy  eoaniuaiciiioa 
faDi  iImi  if  Che  pwpofc  «f  oar  laMordi  profiimme.  Oor  primMiy  rno^rmien  for  mini  •  VQ  moael  U  id  obuin  Maple 
elotod-form  mlytie  loluuoiii,  which  we  miy  tfien  ace  to  develop  oar  ondenundiog  of  more  cowipliGoicd  modeli  in  the  faturt. 

^in  Lonnll  (12]  we  derive  the  ecympiouc  denthy  of  code  veoore  of  lopofr^phic  leppian  trained  oeing  the  miniaram  dinottion 
preccription  of  EqoetioB  3.  end  we  find  thei  it  is  indepwidsiii  of  the  fieighboarfwod  ftnciion,  ecnenmg  fcatar  gaintueiion  end  mild 
monotofiicity  oonstietnu  on  the  nei^iboorbood  ftmciion.  In  Luaiell  (13|  we  presem  m  nformoi  dcrivmion  ot  diis  lasult  for  die  vector 
qoantisMion  cece.  When  we  lei  the  width  of  the  nei|hboarhood  tanction  decrease  to  lero  we  recover  •  lunderd  VQ.  which  censes 
come  of  the  uympuuc  pnpetHa  of  VQs  to  be  the  ceme  ac  dtoce  of  mpogriphic  mippinf  t.  provided  Am  we  nee  the  MD  enooding 
rather  then  the  NN  enooding. 

^Caveetl.  In  the  Inentnra  it  is  conventional  to  o/weyv  use  NN  encoding,  oven  Aon^  MD  encoding  is  Ae  coma  procednre. 
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assume  (incorrectly)  that  the  channels  do  not  mutually  interfere,  we  nevertheless  obtain 
useful  results* 


leconsmiction  1 


y,(xj 


Ni  <(y:) 


reconstruction  2 


Figure  2.  Two-stage  vector  quantisation.  Two  channels 
*1  -*  yi  ••  y/  -*  *i'  and  *2  -*  yj  ■  yj'  *2“^  ***  coupled 
through  a  common  channel  (y,,y2)  (y/.yi'). 

We  may  further  generalise  the  system  in  Figure  2  by  creating  a  multi-stage  smicturc  of 
nested  VQ's,  which  we  may  use  for  time  scries  and  image  compression  (see  [6, 7)). 

2.2.  Ensemble  average  vector  quantisation 

We  now  consider  the  effect  of  mutual  channel  coupling  on  the  optimisation  of  a  nested 


y.  ( *. ) 


x;(y:) 


recon  struenon  1 


input  2 


y,(xj 


<(y:) 


reconstruction  2 


Figure  3.  Ensemble  average  two-stage  vector  quantisation. 
P(y',,y'2*yi>y2)  models  the  distortion  due  to  the  ensemble  average 
of  feasible  (y,,y2)  -*z-*  (yi'.y2')- 


»Fo,  insunoe  ibe  very  mcceitfu)  "anofmly  deiccior*'  iha  we  leponed  in  idiei  cnurely  «n  tfie  iri4epi4em  riwnnfl  ettanpijon 
In  feet,  h  would  have  been  impoiiible  to  Obtain  ih«  very  rapid  ueinini  limes  (of  order  of  2  seconds  per  layer  on  a  VAXitauon  2i00) 
wnhom  the  independence  assumption. 


In  Figure  3  wt  show  m  etaemhk  avenge  veruon  of  Figiat  2,  tkteie  ««  nAt  d>r 
avenge  over  realiaaDont  ol  the  aeiH<l  VQ  o,  j,)  -*m  -* 

Hie  enwiobk  avenge  tfpnmh  ia  appropnaat  m  wwmiww  okcre  oe  eauumtit*}} 
update  the  nested  nnsfomaiion  (y,.yt)  -*  a  -«  O'l  J';)  •*  P**'  *  ffawang  teheaSuk 

Ideally,  wt  should  opdmise  the  a,  -*  y,  y,'  -*  a,'  and  a,  -*  y|  y -*  a/ 
nnsfonnaboos  to  adapt  lo  (be  changes  in  (he  (yi.y)>  -«  a  -*  (y'i.y'})  anAdormaacin.  taui 
this  is  tune  consuming  Rather,  u  u  beocr  for  m  to  arrangt  ihrte  nnsformaaoeo  to 
adapt  to  the  properties  of  the  tmembit  of  O'iJ'il  -*  >  -*  (y'l  jV  nAtfonaanont  ihai 
might  occur. 

Using  the  marginal  PDFs  P(y'i'yt^:)  and  Ply':*yi.yjl  of  Piy'i-yV>i'>j^-  ••  “•> 
the  expression  for  the  dutorion  tn  Figure  3  at 

D«  Jda.da,  P(a,.a,)jdy;  P(y;iy.(a,).y,(a,))|a'{>;)- a,|'  •  {!«;)  (M 

We  may  interpret  the  various  contnbuoons  to  the  exprestton  for  D  h)  aorlutig  from  tjhr 
outside  of  the  expressioo  to  the  inside  as  foitows 

1  The  Jdx,  dxj  P(x,.a])  (  .  )  tniegntion  avenges  over  all  the  patn  of  u<{>ui»  (t,.S;  i 

channels  1  and  2.  and  P(a,.aj)  tpectfies  the  probability  density  »irti  itrhicb  each  paw 
occurs. 

2.  The  /dy',  P(y'|iy,  .>;)  ( ..)  integration  avenges  over  all  the  posubte  diaoroom  of 

channel  1.  due  to  the  confluence  of  channel  1  and  channel  2  in  the  nested  VQ 

3  is  (he  Euclidean  distance  bet»een  the  input  vector  S;  and  it* 

reconstruction  x',(y‘|)  from  the  dinoned  verswo  of  channel  1 

4.  (I  ♦->  2)  denotes  an  analogous  term  for  channel  2 

In  Figure  4  we  represent  diagrammatical  I  y  in  (y  ,.yj)-space  (and  <y’,.)V(P*(^ '  ’he 
vtaious  terms  of  Equation  1 . 

We  represent  the  contours  of  a  typical  PQiOii).  a  typical  Ply’i.yVyi-y!'-  (he  profiles 
of  its  two  marginals  PQVyfyj)  •*’<1  PlyVjfyj)  There  marginals  have  a  shape  that 
depends  on  (Xi.Xj)-  which  therefore  mutually  couplet  the  conthbutions  to  the  distonion 
in  Equation  1.  arising  from  the  tsvo  transformations  X| -*  y*  y/ -*  a,'  and 
*j~*y2 yi  *!■  I*  hoth  pleasing  and  economica]  that  the  ensemble  average  nested 
VQ  in  Figure  3  automatically  determines  the  topographic  neighbourhood  functions  for 
its  X,  -♦  y,  y,'  — s  x,'  and  Xj  -s  yj  y/  -s  Xj'  transformations,  thus  eliminating  the 
need  to  introduce  them  by  hand*. 

In  the  appendix  we  show  how  to  minimise  D  by  using  a  minimum  distonion  prescription 
in  which  we  simultaneously  optimix  y|(K,)  and  y2(X2).  We  sometimes  approximate  this 
by  using  a  nearest  neighbour  prescription. 


*Witb  hiadniht,  Mi  ia  •  povufiil  annw  lor  Mart  •  VQ  anaai  i*  Sm  lira  yiaot  HaT  aa  mmrnfmt  m  mt  t  arM 
mylmlKaiaS  lype  o(  awrlal.  aa  aww  taoty  woald  Sort  aiitaoS  Mr  alo(aa  laX.  Naa  at  art  U«ri  lo  *»  pairMlirj  at  lafoaraa*  r 
inttipraiaiioiia  ol  mon  oanpliaa*M  aiodoli.  optiorc  boloia  *a  vtao  ifaoraai  ol  Ww  aoaaiSrhrr 
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Figure  4,  The  marginal  PDF*  PO'iiyi.jj)  and  Ply'jiy,.),)  of  the 
ensemble  average  diHortion  Ply'ijr'ilyi Jj)  determine  the 
topographic  neighbourhood  function*  for  optimising  the 
y/ *i**'*<l  *3  “*yj  >3' channel* 

3.  Numerioil  experiments 

In  this  section  we  present  a  simple  numerical  simulation  which  demonstrates  some  of  the 
benefits  of  self-supervision 

3.1.  Bask  nctworii  opcralkm 

We  run  all  of  our  numerical  simulations  using  the  network  stnictuie  in  Figure  3.  with  a 
d-dimensional  input  vector  x>(a,,a.;-(x„.x,2.x2,,a22).  and  with  scalar  outpuu  firom  the 
encoders  y,(%,)  and  yjfKj)-  This  is  the  minimal  network  that  functions  as  a  self- 
supervised  VQ.  In  ic^istic  ^^licatkms  we  would  expect  much  more  complicated 
networks  k>  be  used,  but  they  would  all  operate  according  to  the  principles  demonstrated 
by  the  network  in  Figure  3. 

In  Table  1  we  ubulate  the  various  modes  of  operation  that  we  use  in  the  numerical 
simulation  the  we  outline  in  Figure  5 
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Nearest  Neighbour 
Encodine 

Minimum 

Distortion  Encoding 

Independent  Channels 

NN/1 

MD/1 

Correlated  Channels 

NN/C 

MD/C 

Table  1.  Encoding  and  channel  inodes  used  in  our  numerical 
simulations.  NN^nearest  neighbour  encoding,  MD=minimum 
distortion  encoding,  I=independent  channels,  Ocorrclated 
channels. 

In  encoding  mode  NN  we  ignore  the  fact  that  P(y'2lyj,y2)  and  P(y'ily)«y2)  affect  the 
resulting  (y|,y2)  in  Equation  18,  whereas  in  encoding  mode  MD  we  take  full  account  of 
their  influence.  Note  that  the  part  of  Figure  5  that  is  enclosed  in  a  dashed  box  is  the  iimer 
loop  that  handles  the  nunimum  disto.don  aspect  of  encoding  mode  MD.  However,  when 
we  use  encoding  mode  NN,  v/e  must  still  invoke  step  3  (i.e.  "compute  distortion  PDFs") 
of  the  simulation,  because  the  distortion  PDFs  are  required  by  step  6. 

The  two  channel  modes  1  and  C  test  the  effect  of  switching  self-supervision  off  and  on, 
respectively. 

We  now  describe  in  greater  detail  each  of  the  numbered  boxes  in  Figure  S. 

1 .  Clamp  Layer  0  Inputs 

Generate  x=(x,,X2)=(x,,,x,2,X2,,X22)  using  an  appropriate  random  vector  generating 
routine.  We  choose  (x,,,x,2)  as  a  uniformly  distributed  random  vector  in  a  disc-shaped 
region,  and  then  generate  (X2,,X22)  by  rotating  (x,,,x,2)  about  the  disc's  centre  by  an 
random  angle  uniformly  sampled  from  the  interval  l-6,-t-8]. 

The  details  of  how  we  generate  each  x  are  as  follows.  We  use  circular  random  variables 
in  order  to  ensure  that  there  is  no  preferential  orientation.  We  generate  (X21.X22)  from 
(X||,Xi2)  in  the  way  described  in  order  to  ensure  that  the  marginal  PDFs  P(x,,,x,2)  and 
P(X2|,X22)  are  the  same.  We  randomly  rotate  within  [-6,+0]  in  order  to  ensure  that 
(x,,,x,2)  and  (x2i,X22)  are  not  completely  correlated  yet  not  completely  independent  in  a 
way  that  is  controlled  by  the  size  of  0.  The  limit  0=0  gives 

P(x)=P(x,,,x,2)6(x2i-x,,)6(X22-x,2)  (i.e.  identically  correlated),  and  the  limit  0-=Ji  gives 
P(x)=P(x,,,x,2)P(x2|,X22)  (i.e.  completely  independent).  The  overall  effect  of  this 
prescription  for  generating  inputs  x  is  to  create  a  training  set  with  fixed  marginal  PDFs 
and  programmable  correlations,  as  we  require  in  order  to  demonstrate  self-supervision  in 
a  carefully  controlled  way. 

2.  Compute  Nearest  Neighbours 

We  specify  the  stage  0  codebooks  by  x'i(yi)=(x',,(y,),x',2(yi))  and 

*Vy2)*‘(*"2i(y2)’**22(y2))'  so  neatest  neighbour  encoding  prescription  yields 


y!’(’‘ii.Xi2)  = 
y:  (*2I  >*22  )  ~ 


argmin 

yi 

argntin 

Vi 


((x;,(y, )- X„ )" -I- (x^jfy, )- x,2 )’ ) 

((*21<y2)-*2|)^+(*'22(y2)-*22)') 


(2) 
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This  is  a  standard  procedure  which  should  need  no  further  explanaoon  For  brevtt). 
denote  the  result  of  this  operation  as  Encoding  mode  NN  uses  whereas 

encoding  mode  MD  refines  y**  somewhat  in  the  ensuing  steps 

3.  Compute  Distortion  PDF s 

Channel  mode  C:  The  joint  distorrion  PDF  it  P(jr'^y)"P(y'i.y‘jlyt.yj).  whose  two 
marginal  PDFs  Pfy'jlyi.yi)  and  P(yVyi>yi)  *pec»fy  «!>«  topographic  neighbourhoods  of 
the  stage  0  codebooks. 

Channel  mode  I:  We  also  use  the  two  reduced  marginal  PDFs  P(y'i>yi)  ■nd  PtyV):* 
perform  a  control  simulation  in  which  the  pair  of  VO'S  are  trained  tndependentJy  The 
mode  I  simulation  acts  as  a  control  to  ch^  that  the  self- supervision  ^fecu  that  we 
observe  in  the  mode  C  simulation  genuinely  arise  from  the  transfer  of  infocmaDon 
between  the  pair  of  VQ’s. 

We  derive  PiyVyi-y;)  (•twl  from  P(y,.yj)  using  a  heuristic  procedure  which 

we  may  obtain  from  the  following  simplification  of  ^uation  35 

PCy/iyi-yi)-  p(y'i.yj)e*p<-  Kp(yi.y2)(yi'-yi)*)  <?) 

where  we  retain  only  those  terms  that  depend  on  y',.  We  may  interpret  the  terms  in 
Equation  3  as  follows.  The  exponential  factor  determines  the  envelope  of  values  of } 
that  are  permitted  by  the  PDF.  and  the  p(y'|.yj)  factor  provides  a  bias  that  weights  the 
PDF  in  the  direction  of  increasing  code  vector  density  If  we  recall  that  p  «  « 

P*'^  (for  our  N=2  dimensional  VQ's),  then  we  may  replace  the  p  factors  m  Equation  3  by 
P'^  factors.  In  our  simulations  we  shall  go  one  step  further  by  approximating  Equation  3 
as 


P(y[iy,.y2)  = 


«(y;-y,) 

(n(y;-y,)-i-rt(y,-y;))/2 

n(y,  -yj) 


G  >  K 
|G(sk 
G  <  -K 


(4) 


where  Ga0P(y,,yj)/8y,.  We  display  a  typical  set  of  distortion  PDFs  as  histograms 
plotted  against  dy=y',-y,  in  Figure  6 


1 
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Figure  S.  Flowchart  showing  the  main  steps  in  simulating  a  2- 
stage  vector  quantiser,  with  the  second  suge  implemented  as  an 
ensemble  average  vector  quantiser.  The  section  in  the  dashed  box 
is  an  optional  minimum  distortion  encoding  scheme,  which 
refines  the  encoding  found  by  the  nearest  neighbour  scheme. 
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0<-K*^  0>‘*K 


Figure  6.  Typical  approximations  to  the  disionioo 
P(y',lyi<y2)-  We  use  a  very  tunple  preiciipooo  in  which 
P(y',ly  t.yj)  is  set  to  one  of  only  three  possible  PDFs  according  » 
the  value  of  the  underlying  gradient  G>dP(y,,yjVdy,.  These 
histograms  may  be  applied  directly  to  the  suge  0  VO'S  as 
topographic  neighbourhood  functions. 

The  whole  of  Equation  4  (and  Figure  6)  is  specified  by  the  values  of  just  3  numbers. 
Each  dP(y,.y2)/dy,  is  specified  by  the  values  of  3  numbers.  We  choose  to  define  them  as 
follows 


j'(y,-y()  = 


Ji.  dy  =  -l 
n,  dy  =  0 
n.  Ay  =  +1 


(5) 


R. 

«o 

1 

0.35 

0.60 

0.05 

2 

0.30 

0.60 

0.10 

3 

0.25 

0.60 

0.15 

4 

0.20 

0.60 

0.20 

5 

0.15 

0.60 

0.2J 

6 

0.10 

0.60 

0.30 

7 

0.05 

0.60 

0.35 

Table  2.  Values  of  and  Rq  that  we  use  in  7  separate  numerical 
experiments.  Experiment  4  uses  an  unbiassed  dinortion, 
experiments  5-7  use  a  distortion  that  is  biassed  in  the  direction  of 
increasing  PDF  (see  Rgure  6),  and  experimenu  1-3  are  biassed  in 
the  opposite  direction.  Only  experiments  5-7  have  a  distortion 
that  corresponds  to  the  one  required  by  theory. 
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In  Table  2  we  list  the  values  of  x,  and  Xg  that  we  use  in  7  separate  numerical  siniulations. 
We  use  a  variety  of  values  in  ord^  to  investigate  the  effect  of  both  positive  and  itegativc 
biasses,  i.e.  biams  both  in  the  direction  of  and  oppodie  to  the  gradient  of  P(y|,yj).  Note 
that  only  a  positive  bias  corresponds  to  the  leqinremenu  of  Equation  3,  where  p(y'|.y:) 
biasses  PCy'ilyi.yj)  in  the  direction  of  positive  dP(y|.y2V8yi- 

We  may  remove  the  effect  of  self-supervision  by  rqrlacing  Equation  3  by 

P(y(  ly. ) «  p(yi )  exp(-x  p(y,  )(y;  -  y, )’ )  (6) 

where  p(y,)  depends  on  the  margined  PDF  P(y|).  Naturally,  we  can  generate  the 
marginal  PDFs  during  a  numerical  simulation  in  which  we  use  joint  PDFs. 

4.  Conqtute  Expected  Reconstruction  Error 

From  Equation  1  we  may  write  the  expected  reconstruction  error  D<x)  for  the  current 
input  vector  x  as 

D(x)  =  Jdy;  P(y;iy,.yj)((x;,(y,)-K„)'+(x;,(y,)-x„)’)  -i-  (l«-»2)  (7) 

where  we  initialise  (yi.yj)  to  (y^yi)  the  first  time  we  pass  through  the  minimum 
distortion  loop.  We  evaluate  the  integral  over  y',  (and  y'j)  somewhat  crudely  as  a  sum 
using  the  appropriate  histograms  chosen  from  Figure  6. 

5.  Adjust  Encoding  to  Reduce  Error 

Now  that  we  have  calculated  the  expected  reconstruction  error  D(x)  for  our  iniiio/  guess 
(y°,yj)  at  the  correct  values  of  yi(x,2.x,2)  and  y2(x2|,X22),  we  must  investigate  how  it 
varies  in  the  vicinity  of  (y^yj).  We  may  then  locate  the  local  minimum  (yi,y2)  of  D(x), 

which  in  general  will  no:  be  (ypyil^Cy^yj)  (*-C-  minimum  distortion  encoding  is  not  the 
same  as  nearest  neighbour  encoding).  Note  that  for  each  alternative  value  of  (y|.y2)  that 
we  investigate  we  must  repeat  steps  3  and  4  in  order  to  determine  the  corresponding 
value  of  D(x).  In  our  simulations  we  explore  only  the  immediate  neighbourhood  of 

(yi.yt)  given  by 


y,€{y?.y?±l} 

e{y5.y5±i} 


This  rather  limited  search  for  the  minimum  distortion  encoding  succeeds  only  because 
we  choose  to  use  the  distortion  PDFs  P(yVyi»y2)  P(yVyi«y2)  Figure 

6.  If  the  range  of  these  distortion  PDFs  were  greater,  then  we  would  have  to  consider 
using  a  longer  range  search  procedure. 

6.  Update  Code  Vectors 

From  Equation  19b  we  may  write  the  code  vecten-  update  prescription  as 


AxJ  (y^ )  =  tP(yj  ly,  .y,  )(x»  -  x;  (yj )) 


(8) 
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where  (y,.y2)  is  the  minimunn  diftonion  encoding  located  in  neps  3*5  above.  In  our 
numerical  simulations  we  use  e>0.1  throughout  the  opcimisatioo;  we  do  not  gradually 
reduce  e  to  zero. 

7.  Qamp  Layer  1  Inputs 

Now  that  we  have  rinally  decided  what  (y|.y})  should  be  output  by  the  encoders 
(yi(x,),y2(x2))  we  may  use  this  to  clan^  die  inputs  to  layer  2. 

8.  Update  Histogram 

Layer  2  contains  a  (leaky)  histogram  representation  of  P(y|.y2)  which  we  now  update 
according  to  the  prescription  in  Equation  42.  with  the  dmy  term  implement^  as 
r(k)-»r(k)/e  after  every  1/(1-^)  time  steps.  In  our  simulations  we  use  a  memory  time  of 
1/(1-P)*1(X). 

3.2.  Splitting  procedure 

In  [4]  we  presented  in  detail  a  phenomenological  distortion  model  that  we  used  to  obtain 
an  efficient  training  procedure  for  topographic  mappings  and  their  appUcation  to 
multistage  VQ's.  Alternatively,  we  could  use  the  standi  topographic  mapping  training 
procedure  in  [12],  but  this  is  a  rather  inefficient  algorithm.  It  is  much  more  efficient  to 
use  a  splitting  procedure  where  we  perform  a  crude  opatmsaaon  using  2  codeveaors, 
which  we  then  use  to  initialise  a  more  refined  c^timisation  using  4  codevectors,  and  so 
on.  In  our  simulations  we  stop  at  8  codevectors.  This  "coarse  to  fine"  strategy  is  very 
effective  at  rapidly  producing  an  optimum  set  of  codeveaors. 

In  our  numerical  simulations  we  optimise  each  generation  of  code  vectors  using  SO 
training  vectors  per  code  vector,  before  splitting  to  produce  the  initial  code  vector 
configuration  in  the  next  generation. 

3.3.  Experimental  results 

We  now  present  the  results  of  several  numerical  simulations  conducted  according  to 
procedure  that  we  have  described.  We  run  each  simulation  4  times  to  cover  the 
possibilities  NNA,  NN/C,  MD/I  and  MD/C  that  we  show  in  Table  1. 

In  Figure  7,  8  and  9  we  present  the  NNA  and  NN/C  results,  and  in  Figure  10,  1 1  and  12 
we  present  the  MD/I  and  MD/C  results.  In  each  Figure  we  present  two  plots  for  channel 
modes  I  and  C.  The  dashed  lines  indicate  error  bar  envelopes,  where  each  point  that  we 
plot  is  the  average  of  the  value  of  D  obtained  fnim  16  independent  optimisation 
simulations  (in  each  simulation  we  accumulate  statistics  for  256  test  set  samples  to 
estimate  D). 

In  the  case  of  symmetric  distortion  (i.e.  entry  number  4  in  Table  2)  the  I  and  C  plots 
produce  the  same  value  of  D.  This  is  because  this  type  of  distortion  forces 
P(y"ilyi.y2)=P(y'ilyi)'  This  is  a  simple  check  of  the  consistency  of  our  I  and  C  results. 

For  positively  biassed  distonions  (i.e.  in  accord  with  theory)  the  C  plots  are 
systematically  lower  than  the  I  plots.  This  behaviour  demonstrates  convincingly  that  self¬ 
supervision  produces  a  reduced  reconstruction  error,  whether  NN  or  MD  encoding  is 
us^. 
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For  negatively  biassed  distonions  (i.e.  in  contradiction  with  theory)  the  C  plou  are 
systematically  higher  than  the  I  plots.  Because  we  use  an  anifkially  incorrect  ^storrion 
it  is  difficult  to  interpret  this  result  closely.  It  merely  corroborates  what  we  might  have 
expected  to  happen  when  we  ignore  what  ^  theory  tells  us  to  do. 

When  we  study  the  effect  of  encoding  mode,  we  discover  that  the  MD  plots  are 
systematically  lower  than  the  corresponding  NN  plots.  This  behaviour  demonstrates 
convincingly  that  a  full  search  for  the  appropriate  encoding  is  better  than  a  partial  search, 
whether  I  or  C  channels  are  used.  This  is  to  be  expected. 

As  the  correlation  between  the  channels  is  reduced  (i.e.  increase  the  input  correlation 
angle),  the  diffeience  between  the  1  and  C  plots  systematically  decteam,  except  for 
Figure  12  where  we  present  the  MD/I  and  MD/C  plou  for  unconelated  inpuu.  We  would 
expect  that  I  and  C  plots  should  overlap  in  Figure  12  and  in  Figure  9,  b^use  there  ate 
no  correlations  between  the  channels.  However  Figure  12,  and  to  a  letter  extent  Figure 
9,  show  a  clear  departure  from  this  expecution.  This  apparent  failure  occurs  because  the 
histograms  suffer  from  Poisson  statistics,  so  they  do  not  record  independent  channel 
statistics  (i.e.  what  is  recorded  in  the  histograms  does  not  satisfy  P(y|.y2)^f‘(yi)f‘(y2))>  ^ 
the  C  simulation  is  affected  by  these  spurious  ctSTclations  to  produce  resulu  that  differ 
from  the  I  simulation. 

Taken  together  Figures  7-12  demonstrate  the  consistency  of  our  numerical  simulations, 
and  demonstrate  the  benefiu  of  self-sup^vision  (and,  coincidentally,  minimum 
distortion  encoding)  when  a  simple  network  is  applied  to  an  arnficially  constructed  set  of 
data.  These  results  coiroborate  the  theoretical  results  that  we  presented  earlier. 

4.  Conclusions 

The  main  result  that  we  present  in  this  memorandum  is  the  theoretical  derivation  of  and 
numerical  simulation  of  the  phenomenon  of  self-supervision.  For  illustrative  purposes  we 
consider  the  problem  of  a  pair  of  communication  channels  that  cause  mutual  distortion. 
In  our  numerical  simulations  we  present  a  simple  denxrnstration  of  the  improvement  in 
performance  that  we  can  obtain  be  jointly  c^timising  the  pair  of  conununication 
channels,  compared  with  independent  optimisation. 

In  order  to  make  contact  with  the  theory  of  unsupervised  adaptive  networks  we  model 
the  communication  channel  problem  as  a  nested  VQ,  as  shown  in  Figure  2.  The  effect  of 
the  inner  VQ  models  the  mutual  channel  distortion,  and  thus  influences  the  way  in  which 
the  outer  VQ's  must  be  optimised.  This  is  the  phenomenon  of  self-supervision,  where 
one  part  of  an  overall  unsupervised  network  supervises  the  optimisation  of  another  part 
of  the  network. 

This  principle  may  easily  be  generalised  to  a  multilayer  unsupervised  network,  although 
we  have  not  done  so  in  this  memorandum.  This  would  mean  that  we  operate  an 
unsupervised  multilayer  network  in  such  a  way  that  it  supervises  its  own  internal 
operation  by  passing  control  signals  back  fnMn  higher  layers  to  lower  layers,  which  in 
turn  causes  the  lower  layers  to  process  their  inputs  trxrre  ef^fectively. 
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5.  Recommendation 

The  results  that  we  present  in  this  memorandum  sugfcsts  an  inierestinf  new  direction  of 
research  into  multisuge  LVQ  networks.  The  LVQ  approach  (1]  to  training  a  classifier 
network  has  had  limited  success  because  it  uses  a  single  stage  VQ  (with  superviskm). 
Our  multistage  VQ  network  could  be  supervised  in  the  tame  way  as  in  the  LVQ  method, 
and  the  training  signals  backpropagaied  down  through  the  layers  of  the  netwoik.  At  each 
layer  the  backptopagating  signals  would  consist  of  two  componenu:  a  term  which 
requires  the  code  vectors  to  be  updated  O-e.  a  programmable  topographic  neighbourhood, 
as  in  self-supervision),  plus  a  term  which  requires  the  layer’s  inpuu  to  be  iqrdated.  The 
first  type  of  term  is  familiar  from  our  experience  with  self-supervision,  whereas  the 
second  term  is  new  (it  did  not  occur  in  our  simulations  because  we  did  not  experiment 
with  multilayer  networks). 

6.  Notation  and  terminology 

Single  stage  vector  quantiser: 

X  =  input  data 

y  -  compressed  dau 

x'  =  reconstruction  of  the  input  data 

y(x)  =  compression  operation,  mapping  x  -*  y 

x'(y)  =  reconstruction  operation,  mapping  y  -♦  x' 

P(x)  =  PDF  of  input  data 
P(y)  =  PDF  of  channel  data 
Two-stage  vector  quantiser; 

x,  =  input  data  (channel  1) 

Xj  =  input  data  (channel  2) 

y,  =  compressed  dau  (channel  1 ) 
y2  s  compressed  dau  (channel  2) 

y',  «  distorted  compressed  dau  (channel  1) 
y'2  «  distorted  compressed  dau  (channel  2) 
x'l  » reconstruction  of  the  input  dau  (channel  1) 

X  2  « reconstruction  of  the  input  dau  (channel  2) 
y,(x,)  >  compression  operation  (channel  1),  mapping  x,  ->  y, 
y2(x2)  compression  operation  (channel  2),  mapping  X2  -*  Vi 
x',(y't)  s  reconstruction  operation  (channel  1),  mapping  y',  -» x', 

*  reconstruction  operation  (channel  2),  mapping  y'2  x'j 
z(y,,y2)  B  compression  operation  (fusing  channel  1  and  channel  2) 


14 


Stephen  P  Lutoell,  6  Decetnber  1991 


y‘,(z)  -  reconstruction  operation  (recovering  channel  1 ) 

y'jCa)  *  reconstruction  operation  (recovering  channel  2) 

P(x,,rt2)  joint  PI^  of  input  data  (channel  1  and  channel  2) 

P(x,) «  marginal  n)F  of  input  dan  (channel  1) 

P(X2)  >■  marginal  PDF  of  input  daa  (charnel  2) 

P(yi.y])  >  joint  n)F  of  compressed  dau  (channel  I  and  channel  2) 

Pfy'id'Vyi'yr)  *  conditional  K>F  oi  distorted  conqxesied  dau  (chatuiel  1  and 
channel  2) 

P(y'i*yi>y2)  *  nurginal  conditional  n>F  of  distorted  compressed  dau  (channel  1 ) 
P(y'2lyi>y2)  marginal  conditional  PDF  of  distorted  compressed  dau  (chatmel  2) 
Note  that  we  use  the  terms  compress/encode  (and  reconstruct/decode)  interchangeably. 

We  also  use  the  generic  noution  P(')  to  denote  a  PDF,  so  unless  we  sute  otherwise  the 
functional  form  of  P( )  may  be  deduced  from  the  luture  of  the  argument  that  we  insert 
into  the  function. 

Finally,  we  use  the  word  "suge"  to  denote  a  pair  of  adjacent  layers  in  a  multilayer 
network.  Thus  "stage  0"  means  "layers  0  arxl  1".  We  use  this  terminology  to  refer  to  the 
transformation  between  layers,  rather  than  the  layers  themselves. 

7.  Appendix 

In  this  section  we  present  a  resumf  of  the  VQ  theory  of  Linde  et  al  [10]  (the  LBG 
algorithm),  and  its  extension  to  multisuge  VQ's  (2,  3, 4,  5}.  These  extensions  are  related 
to  the  VQ  theory  of  Kumazawa  et  al  (1 1]  for  communication  over  a  noisy  channel,  and 
to  the  topographic  mapping  theory  of  Kohonen  [12]  for  training  self-organising  neural 
netwoiks'®. 

7.1.  Vector  quantisation 

Define  x  as  the  input  dau,  y  as  the  ctxnpressed  dau,  and  x'  as  the  reconstruction  of  the 
input  dau.  Define  y(x)  as  the  compression  qieradon  x  -»  y,  and  x'(y)  as  the 
reconstruction  operation  y  ->  x',  which  yields  overall  x'>x'(y(x)).  Note  that  the 
compression  and  reconstruction  process  rruy  be  interpreted  in  temu  of  encoding  and 
decoding  during  transmission  of  inftxmation  through  a  noiseless  communication 
channel",  as  shown  in  Figure  1. 

We  rruy  combine  these  quantities  to  obtain  the  average  L2  (i.e.  Euclidean)  distortion  D, 


VQ  *wty  ml  Wpaifnifcic  MfWoS  Amo  m  mt  tmaiy  iqiiivatM,  bn  *•  (f  aHowiif  Hpofn^c-iypt 
■iBiiiU*  w  ■M'S*  MMay  Uam  niWwIiMiop  af  t  Lyipm*  hiuieii  canal  ba  nfciaairlmiiad.  k  a  laiwag  w  niitia  «ai 
Kafeaaai  UnaM  hna  fenaalaaS  ka  daaijf  hi  Mi  aajr  ia  Ha  fan  plica. 

"Oa  laaiki  acaM  miml  ha  i|ipliad  la  ha  apiiiiiaian  af  VQi  be  ccaaiiaairaiai  cT  ialanBaiaa  Saeaih  BBiiy  caaaaaaicaica 
chaaiili.  bai  Aa  k  aai  Aa  paipaic  af  car  aaiich  pagiaaiaa.  Oar  prl—i)  aMiraaa  far  laaf  a  VQ  aiahil  a  la  ahaai  aavlc 
daws  fanBaalyiiciilatioai.iHhdiaaaia)f  AaaamaSiiiliip  airiaSiniiadhitcraiaiar»iaipliriailiaa*ili  hi  Ua  Shaw. 
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D,  =JdxP(x)|x'(y(x)-xf  (9) 

The  VQ  whose  (continuum  limit)  codebotA  it  defined  by  the  pair  of  functions 
(y(x),x'(y))  can  be  optimised  by  minimiting  D,  with  respect  to  variations  of  y(x)  and 

x'(y). 

7.1.1.  Vector  quantisation  for  a  noisy  channel 

A  more  general  form  of  Equation  9  that  gives  the  average  Lj  distortion  for  a  VQ  with  a 
noisy  communication  channel  (3, 4.  S,  II]  is 

Di  =  I  dx  P(x)  J  dy'  Jt{y'  -  y(x))|x'(y')-  xf  (10) 

In  Equation  10  we  assume  that  y'>y(x)-Ht,  where  n  is  a  random  noise  variable  with  PDF 
x(n),  so  P(x,n)=P(x)it(n),  assuming  x  and  n  are  independent. 


7.1.2.  Nearest  neighbour  versus  minimum  distortion  encoding 

We  functionally  differentiate  Dj  to  calculate  the  zeros  of  SDJiy(%)  and  SDj/Sx'fy), 
which  yields'^  (see  [3, 4,  5]  for  the  details) 

y(x)  =  J  dy'  7i(y'-  y)|x'(y')-  xj'  (11) 


fdxP(x)Jt(y-y(x))x 

X  (y)  =  V - 

JdxP(x)n(y-y(x)) 

Ax'(y)  =  e  Jt(y-  y(x))(x  -  x'(y)) 

In  Equation  12  there  are  two  methods  of  updating  x'(y). 

1.  Batch  update  (Equation  12a);  This  is  equivalent  to  one  cycle  of  the  LBG  algorithm 
llOJ. 

2.  Continuous  update  (Equation  12b);  This  is  identical  to  the  topographic  mapping 
training  algorithm  [12],  so  n(n)  can  be  interpreted  as  a  topographic  neighbourhood 
function*’. 

In  Equation  1 1  there  are  two  distinct  cases  to  consider. 


(12a) 

(12b) 


fM  Sw  owput  ot  Uw  cncodcrif  anully  •  Aienic  vnitbic  aoci  not  invilidMe  our  ue  c(  ■■  tunnwd  comiiHMUi  ouipM  in  our 
Wc  mt  oaotawam  difiviiioiif  baewic  m  m  wIm  i>  |oinf  «a.  Aftnwirdi.  wt  convert  «ir  cowimMim 

dtrivaiiow  imo  tfisenu  dnivtiions  by  «iclimfin|  imcfrtii  for  hr»o  iwivoijvot  fer  Mio  dillmncti.  «c.  Note  riui  ii  is  aoi  in 
fMtnl  to  convert  in  the  eppoiite  dneoion  (t.e.  from  •  direme  caladMion  imo  •  "anocth"  centiwiWB  celcaleiicn).  bnt  dtir  doei 
not  effect  ow  lewht. 

^^Tllieptwe  to  he  every  fertile  wny  of  ibeofeiiceMyhiwdlingicpotFiphicwepieiit  yhenowene. 
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1.  Nearest  Neighbour  Encoding  (NN):  When  x(n)°>5(n)  (i.e.  the  noiseless  case) 

Equation  1 1  specifies  a  nearest  neighbour  encoding  piescription  yH*),  i-C-  given  x 
select  as  the  y  that  minimises  llx'(y)-xlP. 

2.  Minimum  Distortion  Encoding  (MD):  When  x(n)it6(n)  Equation  11  specifies  a 

minimum  distortion  encoding  prescription  y(x),  where  the  effect  of  is 

anticipated  when  selecting  y(x). 

NN  encoding  can  be  used  as  an  approximation  to  MD  encoding  when  it(n>^n).  In 
order  to  compare  NN  with  MD  enc^ng  we  develop  a  Taylor  series  expansion  of  the 
lla'Cy^xlP  factor  in  Equation  1 1  about  its  stationary  point  y'«=y®(x) 


|x'(y')-xf  =d‘'+il(y'-y‘).(y'-y*),dj+il(y'-y'’).(y'-y*);(y'-y")»di-*-  (i3) 

i.i  i.is 

where  d°  d,j  and  d,^  are  the  zeroth,  second  and  third  derivatives  of  llx'(y)-x)|t  at  y=y®(x) 
d“«|x'(y“(x))-x|* 


ay.3y; 


t'(y)-x||’ 


(14) 


dy^ay^dy^ 


Dx'(y)-xl 


!•»•<») 


whence 


DjCx)  =  Jdy' Jt(y'-y)  llx'(y')-xll2  = 


''.j<y- A-*-2.im.)  +( Jt^y-y'Jjfy-A+z**"^)  ■Ky-A(y-Aj(y- +  -(15) 

iit 

where  we  have  defined  the  first  three  moments  of  ii(n)  as 
Jt,'  »  jdn  n(n)  n^ 

xfj  ■  /dnjt(n)ninj  (16) 

Jt,^  *  /dn  x(n)  njnjn,, 

To  locate  the  minimum  distention  encoding  y(x)  we  must  minimise  the  expressitm  given 
by  Equation  IS  with  respect  to  y. 
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If  we  ignore  the  skewness  matrix  then  we  can  reduce  the  prohletn  to  minimising 
X(y-y®+n‘),(y-y®+n'),d,’,  The  Hessian  matrix  d^  has  no  negabve  eigenvalues,  because  in 

Equation  13  we  expanded  about  a  local  minimum  of  llx'(y>-xlP,  so  we  obuin 

y(*)  =  y*(x)  -  It'  (17) 

The  solution  is  shifted  away  from  the  nearest  neighbour  encoding  y^x)  by  an  amount 
equal  to  minus  the  bias  that  x(n)  introduces.  This  is  intuitively  reasonable  because  die 
effect  of  minimum  distortion  encoding  anticipates  the  distorting  effect  of  it(n).  and  will 
compensate  for  any  bias  that  it(n)  introduces. 

When  it'^  it  is  important  to  retain  d,^  because  it  is  then  the  lowest  order  contribution  to 
the  difference  between  y(x)  and  y®(x). 

7.1.3.  Mean  field  versus  local  field  optimisation 

The  update  prescription  depends  on  the  biassed  marginals  PCy/lyi.y,)  and  Pfyj'lyi.y.). 
which  causes  a  migration  of  P(y  i.yj)  as  shown  in  Figure  1 3, 

The  widths  of  the  marginals  Piyj'lyi.yj)  and  P(y2*'y|.yj)  determine  the  widths  of  the 
vertical  and  horizontal  bands  of  P(y,.y2)  that  art  affected.  It  is  these  changes  in  Ply,.):) 
(and  hence  PCy/lyi-y:)  P^yj  ^ypyr))  cause  the  differences  between  the  "mean 
field"  and  "local  field"  optimisation  procedures. 

Strictly  speaking,  the  change  to  P(yi,y2)  is  not  restricted  entirely  to  the  vicinity  of  the 
two  regions  indicated  in  Figure  13.  For  instance,  the  movement  of  the  code  vectors  in 
the  topographic  neighbourhood  of  yjfx)  and  y2(x)  can  change  the  shape  of  the 
quantisation  cells  of  other  code  vectors,  which,  in  turn,  causes  other  changes  to  P(yj.y2)- 
However,  this  is  a  second  order  effect. 

We  see  from  Figure  13  that  the  net  migration  averaged  over  all  inputs  has  the  affect  of 
squeezing  the  P(yi,y2)  distribution.  This  inward  pressure  is  counterbalanced  by  the 
stretching  tendency  of  each  marginal  P(y,)  and  Piyj)  to  become  approximately  uniform, 
as  normally  occurs  in  VQ's'^. 

When  we  perform  a  "mean  field"  simulation  we  do  not  take  account  of  these  changes  to 
P(yi.y2)  hence  P(yi'ly|.y2)  and  P(y2'lyi.y2))  when  we  calculate  the  gradient  of  D  in 
Equation  1.  However,  in  our  simulations  we  represent  P(yi.y2)  w  a  slowly  drifting 
histogram,  so  the  changes  to  P(y],y2)  gradually  become  felt  later  on  in  the  simulation. 
This  does  not  mean  that  we  effectively  take  P(yi,y2)  variations  into  account  in  a  "mean 
field”  simulation,  because  when  the  "mean  field"  simulation  reaches  equilibrium  so  that 
the  drift  of  P(y  i  ,>2)  vanishes,  the  "local  field"  gradient  of  P(y  j  .yi)  does  not  vanish. 


MomMy,  we  cm  inuipKi  ihii  eampaHm  ae  rniiint  »  mamm  ihe  mauel  nlommmi  ll/iOs)  Sawien  f/  mi  f2-  That 
Ilr|'.r2l*l^lr|HHlr2l*H(j|j2)aO,  vim  hi  |  n  Sm  mUQfff  fit  in  ttpmmi,  mi  midnii  mmt  H|S||  mi  Hljjl  is  iscfctic. 
whereei  iqueeaiit  emwt  H|ria2l ** aKisuc.  iHitcc  t|r|:T2l iMnsic,  eHMsili  d»i  ii  aM  ebialsuly  gintiiiiisS.  MwimI 
ntofiiutian  can  be  stsd  u  snr  batic  opUmiiaiion  cnieiisn  amaaSaf^  SinnnicR  inininiifaiMii.  An  lamiyli  efUii  ui|ii  uii  b  mi  iu 
felaticnilitpioaieopeauaiionof  ansvel  daii  of  hieierdiical  Gibba  dinribuiiem  can  be  fomd  in  LsUfell  IIS.IS). 
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Figure  13.  Migration  of  the  PDF  P(y|,y2)  due  to  the  self- 
supervision  effect  of  the  marginal  PDFs  PCy'ilyj.yj)  and 
P(y'jly,.y2).  which  are  the  topographic  neighbourhood  functions 
for  optimising  the  x,  -♦  y,  •  y,'  -♦  x,'  and  Xj  -♦  yj  •  y^'  -*  *j' 
channels.  Contributions  to  Pfyi.yj)  which  lie  inside  the  vertical 
shaded  band  tend  to  migrate  tovmls  the  left,  and  contributions 
inside  the  horizontal  band  tend  to  move  upwards.  In  all  cases  the 
migration  is  in  the  direction  in  which  the  corresponding  marginal 
PDF  is  biased.  Compare  Figure  4. 

7.1.4.  Two-stage  vector  qtiantising 

When  we  minimise  D  using  the  mean  field  procedure  we  obtain 

(yi(*)).y2(*2))  ■=  ^/dy'iP(y'i<yt.y2)  +  d 2)  j  (i8) 

/dx,  dx,  P(x,,x2)  P(yi.iyi(X|).y2(«2)) 

x'k(yk )  =  — -  ( 1 9a) 

Jdx,  dxj  P(x„X2)P(ykly,(x,),yi(X2)) 

Ax'kfyi,)  ■  e  P(yklyi(*,)0'2(»i))  (*9b) 

which  should  be  compared  with  Equatitm  11  and  Equation  12.  Note  that  Equation  18 
specifies  a  minimum  ^storrion  prescription  in  which  we  simultaneously  optimise  y|(x,) 


j 
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and  y2(*2)‘  “*>"8  P(y*i'yi-y2)  *"<1  P(y’2*y|.y2)  >nsiead  of  n(y'-y).  We  sometimes 
approximate  this  by  using  a  nearest  neighbour  prescription.  Note  that  in  [13]  we 
reported  (in  the  case  of  scalar  quantisation)  that  this  approximation  is  forbidden  if  we 
wish  the  density  of  code  vectors  to  be  insensitive  to  the  choice  of  PfyVyi.yr) 
PCy'i'yi.yt)- 

Note  that  in  [13]  we  reported  (in  the  case  of  scalar  quantisation)  that  the  use  of  nearest 
neighbour  encoding  is  forbidden  if  we  wish  the  density  of  code  vectors  to  be  insensitive 
to  the  choice  of  P(y'2lyi.y2)  P(y'i'yi.y2)- 

7.2.  Analytically  solvable  quantisation  model 

In  this  section  we  present  an  analytically  solvable  model  of  the  ensemble  distortion 
P(y"i*yVyi>y2)  shown  in  Figure  4. 

7.2.1.  Code  vector  density 


If  the  number  of  code  vectors  (y',(2).y"2(*))  *he  (yi.yj)  -*z-*  (y'l.y'i)  codebook  is 
very  large,  then  we  may  calculate  P(y'i.yVyfy2)  directly.  Thus,  we  model  the  ensemble 
properties  of  the  codebook  by  defining  p(y,,y2),  which  specifies  the  density  of  code 
vectors  (y',(z).y'2(*))  (yi.y2)-space- 

7.2.2.  Transition  probability:  integral  equation 


Note  that  we  use  the  notation  p(y)  (and  P(y1y))  and  p(y|,y2)  (and  P(y'i,yVy|..V2)) 
interchangeably.. 

In  Figure  14  we  compare  the  nearest  neighbour  encoding  prescription  for  a  single  VQ 
with  that  for  an  ensemble  of  VQ's'^. 

In  Figure  14a  we  show  an  input  vector  (represented  by  a  cross)  and  the  known  positions 
of  the  code  vectors  of  a  single  VQ.  The  nearest  neighbour  can  be  located  by  expanding  a 
circle  centred  on  the  input  vector  until  it  grazes  the  nearest  code  vector,  as  shown.  In 
Figure  14b  we  show  the  ensemble  version  of  the  same  diagram,  in  which  the  precise 
code  vector  positions  are  unknown,  so  there  is  a  distribution  P(y'ly)  of  possible  nearest 
neighbour  locations.  There  is  an  analogous  interpretation  for  the  minimum  distortion 
encoding  prescription. 

Using  Figure  14,  we  may  write  down  an  integral  equation  that  relates  Plyly)  to  p(y). 
Thus 


P(y'ly)  8y'  =  fl-  Jd^  Pf^lylVCy')  6y' 

H-rnsj'r"  J 


(20) 


complettnefs,  we  diictifs  here  the  iniermedietc  ceic  where  the  poihiom  oi  the  code  veoort  ere  penielly  known.  The  moti 
ioporunt  wty  of  oequirif^  penial  knowlodfe  ii  to  note  the  poanioni  of  the  aeaiest  edihbour  code  vectors  durint  tfminin|.  However, 
soA  knowlc^c  must  be  continuously  updated  because  miireiion  of  the  code  vector  positions  gradually  erases  any  memory  of  their 
earlier  posittons.  I^tial  knowtedge  ties  between  die  euremes  of  Hgure  Sa  mtd  Figure  5b.  and  its  analysts  is  vary  cemplicaied.  We 
choose  10  analyae  the  aiifcme  case  in  Figure  Sb  because  it  wderestimaies.  laihcr  than  oveiestiniaiet.  dw  tisowledige  about  die 
codavccaon  dial  is  available 
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where  the  first  term  on  the  right  hand  side  is  the  probability  ihu  there  is  no  nearest 
neighbour  code  vccux  in  the  sphere  of  radius  Hy'-ytl  centred  on  y.  and  the  second  term  is 
the  probability  of  finding  a  code  vector  in  the  volume  Sy'  at  y".  The  prothict  of  these  two 
terms  gives  the  probability  of  finding  the  nearest  neighbour  code  vecmr  in  the  volume 
Sy'aty'. 


Figure  14.  (a)  Determining  the  nearest  neighbour  code  vector 
position  for  a  single  vector  quantiser,  (b)  Determining  the  PDF 
P(y1y)  of  the  nearest  neighbour  code  vector  position  from  the 
code  vector  density  p(y)  of  an  ensemble  of  vector  quantisers. 


7.2.3.  Transition  probability:  constant  code  vector  density  case 


We  now  solve  Equation  20  for  the  case  p{y)>^0«constant.  The  nearest  neighbour  code 
vector  is  then  equally  likely  to  lie  in  any  direction  from  y,  so  PCyly)  must  be  a  function 
only  of  the  radial  disunce  lly'-yll,  which  gives 


P(lly'-yll)  = 


']•  Jd4P(ll4-yll)1p. 

^  iKi-JlISiy-.jli 


(21) 


where  lly'-yll^(y'-y)^  (y'-y).  The  integrand  is  spherically  symmetric  so  we  may  use  the 
transformation 


/d4  P(llVy'l)  =  On  '  POI^II) 


(22) 


where  is  a  constant  deriving  from  the  angular  integration  in  N  dimensions. 
Differentiate  Equation  21  with  respect  to  the  upper  limit  lly'-yll  of  the  II^H  integration  to 
yield 


-  -  “nPo  lly'-yll^'-'  P(iiy'-yii) 
and  integrate  to  yield  finally 


P(lly'-yll)-PoCxi 


(23) 


(24) 


where  Pg  should  be  adjusted  to  ensure  that  P(lly'-yll)  is  normalised  correctly.  The  N*2 
case  reduces  to  a  Gaussian  distribution  with  PgsI/fdK^Po). 
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7.2.4.  Transition  p  -obability:  variable  cod*  vector  denaity  cate 


We  now  extend  the  previous  results  to  the  case 

piyO  ■  p(y)  (y'-y)'''.Vp(y)  (25) 


which  is  a  fint  order  Taylor  expansion  of  p(y')  about  the  point  We  antidpaie  that 
the  first  order  expansion  of  Pfy'ly)  has  the  form  of  Equation  24  with  an  extra  factor  to 
account  for  the  angular  dependence  in  Equation  25 

PCyiy)  -  PqO  )  ( 1  +  (y'-yf  .•(>))  e*p(-  (26) 

where  a(y)  has  to  be  determined.  Differentiating  Equation  20  with  respect  to  y'  leads  to 


aP(yly) 

ay' 


^  /d4  P(4ly)  plyO  + 


P(yly)ap(y') 
P(y')  ay' 


(27) 


where  we  used  Equation  20  to  replace  a  term  by  a  P(y'ly)/p(y')  factor.  We  may  now 
insert  the  expressions  for  p(y')  (Equation  25)  and  P(y'ly)  (Equation  26)  into  Equation  27, 
and  make  use  of  the  results 


=  ^Po(y)  (1  +  (y'-y)^  a(y))exp[-  ]] 

=  Po(y)(*(y)-aNp(y)iiy'-y"^  V*y)(>+(y'-y)’'-*(y)))  • 

exp(. 

which  we  obtain  by  using  3/0y'lly'-.vll''=Nlly'-yll^-*(y'-y), 


^  Mp(^iy) 

=  Po(y)aN"y'-y"'^'*(y'*y)c*p[-  j  (29) 


which  we  obtain  by  using  Equation  22  to  perform  the  angular  integration  over  the 
surface  II^IHIy'-yll,  and  noting  that  the  term  containing  (y'-y)‘'^.a(y)  vanishes  after  angular 
integration. 


ay' 


Vp(y') 


(30) 


to  obtain 


a(y)  -  aNp(y)  lly'-yll^-^(y'-y)(l+(y'-y)’'.a(y))  = 


-  ONp(y')  iiy'-yi'^'^(y'-y)  +  (i+(y'-y)'^-»(y)) 


Ysi^l 

p(y') 


(31) 
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We  then  use  pO'>«p(y)(l+(y'-yF-^P(y)/p(y))  ^pOO/pO'V^pO'VpO')  «>  liinplify 
Equation  31  into  the  form 

^«(y)  -  “  ®N  P(y )  (y'-y)  (/-y  F.^*(y )  -  j  (32) 

where  we  have  drc^ped  the  next-to-leading  order  terms.  We  may  strive  Equation  32  by 
choosing  a(y)>Vp(y)/p(y),  to  obtain  the  generalisation  of  Equation  24  as 

Strictly  speaking.  Equation  33  does  not  specify  a  valid  probability  distribution  because  it 
yields  a  negative  probability  when  (y'-y)‘’'.Vp(y)/p(y)  <  -1.  However,  this  result  is  the 
leading  order  term  in  a  Taylor  expansion  about  y'*y  (see  Equation  25),  therefore  we 
implicitly  assume  ll(y'-y)TVp(y)/p(y)ll  *  1.  The  effect  of  the  l+{y'-y)^-Vp(y)/p(y)  *0™ 
is  to  relocate  the  maximum  of  P(y'ly)  original  position  at  y'»y  (see  Equation  24) 

to  a  new  position  given  by 


y' 


y  + 


llVp(y)ll  Vp(y) 
On  P(y)^  llVp(y)ll 


m 


The  direction  of  shift  is  consistent  with  the  bias  in  P(y'ly)  that  we  show  in  Figure  4  and 
Figure  13. 


Finally,  we  marginalise  the  joint  distribution  P(yi',yj'ly|,y2)  in  Equation  33  in  order  to 
calculate  P(y|'ly|,y2)  and  P(y2'ly,,y2)  (which  we  need  in  Equation  18  and  Equation  19). 
For  the  2-dimensional  case  (Na2,  a^^ln,  (y|,y2)~>(yi>y2))  easy  because  the 

exponential  factors  are  Gaussians,  leading  to  the  result 


PCy/'yi.yj)  -  PoCyi.yr)  i  ^  P(yi>y2>  (yi'-yi)’) 


(35) 


with  an  analogous  result  for  P(y2'lyi.y2)-  These  results  may  be  used  to  model  the 
marginals  in  Figure  4. 


Recall  that  P(yi'ly,,y2)  and  P(y2'ly|.y2)  serve  as  topographic  neighbourhood  functions  for 
optimising  the  x,  -» y,  -  y,'  -♦  x,'  and  X2  -» y2  ••  y2'  -♦  *2^  transformations.  In  the 
ensemble  average  model,  these  neighbourhood  functions  emerge  naturally  from  the 
ensemble  properties  of  (yi,y2) -♦  z -» (y',,y'2),  so  we  do  not  need  to  supply  them 
manually. 

The  automatic  generation  by  one  part  of  a  network  of  the  topographic  neighbourhood 
function  required  by  another  part  of  the  same  network  is  sufficiently  novel  and  important 
that  we  call  it  self-supervision.  It  is  an  effect  that  lies  halfway  between  full  supervised 
training  with  an  external  teacher,  and  unsupervised  training.  It  is  an  economical  way  of 
extending  the  capabilities  of  an  unsupervised  network  towards  those  of  a  supervised 
network'*. 


'*Self-npc>viHd  neiwofki  m  imble  lo  pradiioe  ihc  ogifou  nqnind  by  ■>  «n«wl  mrtirr  (bacaiw  Swn  »  m  iMdici).  All  dwy 
cm  do  if  10  tupervise  thtir  meiml  opcfstion,  bM  not  Um  o(  Mr  owpa  liytr. 
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7  Estimating  the  code  vector  density 

In  order  to  determine  the  result  for  Ffyly)  in  Equation  33  we  must  estimate  p(y)  and 
Vp(y)  in  Equation  33.  We  may  use  the  asyn^totic  relationship'- 


P(y)  «  P(y) 

(36) 

Vp(y)  N  VP(y) 
p(y)  “N+2  P(y) 

(37) 

to  express  our  results  either  in  terms  of  P(y)  or  p<y). 

If  we  record  P(y)  as  a  histogram  of  frequencies  of  occurrence  of  input  vectors  y,  then  we 
may  directly  estimate  VP(y)/P(y),  and  thence  estimate  Vp(y)/p(y)'*.  We  may  also  make 
a  crude  estimate  of  Vp(y )/p(y)  from  a  single  realisation  of  the  code  vectors. 

7.3.1.  Estimation  from  the  histograms 

We  will  now  describe  how  to  estimate  P(y)  and  VP(y)  from  a  histogram.  Denote  the 
transformation  from  continuous  to  discrete  variables  as 

k,(y.)Hintr-  gs]  os) 

Vyiiiux  /ijnin  ^  ) 

where  yi_„i„  and  yj^^,  are  the  minimum  and  maximum  values  that  can  possibly  take,  B 
is  the  number  of  bins  in  each  dimension  of  the  multidimensional  histogram,  and  5  is  a 
small  positive  number  that  we  introduce  to  ensure  that  0Skj(yj)<B  (i.e.  a  strict  inequality 
at  the  upper  end  of  the  range).  Thus  the  full  vector  index  required  to  locate  a  bin  in  the 
multidimensional  histogram 

k(y)  =  (k,(y,),k,(y,),...,k^(y^.))  (39) 

which  we  then  update  using 

r(k(y))-4r(k(y))+l  (40) 

It  is  important  that  the  histogram  should  also  have  a  finite  memory  time  in  order  that  it 
can  track  a  time  dependent  P(y).  This  is  easily  arranged  by  making  the  histogram  bins 
leaky.  For  instance,  the  number  of  counts  in  each  bin  could  be  a  real  number  (not  an 
integer),  all  of  which  simultaneously  decay  (before  r(k(y))  is  updated)  according  to  the 
prescription 

rfkl-^Ml*)  (41) 

where  0<^1.  The  overall  update  process  would  then  be  described  by 

r(k;t+l)  =  Pr(k;t)+v(k:t)  (42) 


'^Suidly  ipaakini.  (hit »  mie only  for  nainNim  dmonion  •ncodini. 
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where  v(k;t)  is  a  multivariate  Poisson  fntKess  which  derives  from  the  hits  on  the  array  of 
histogram  bins.  The  "memory  time"  of  this  type  of  process  is  1/(1-^)  time  steps,  so  P  can 
readily  be  used  to  control  the  ability  of  the  histogram  to  track  a  time  dependent  P(y). 

A  less  computationally  intensive  prescription  for  histogram  decay  would  be  to  use 
Equation  42  to  decay  the  histogram  bins  cmly  occasionally.  For  instance,  if  we  decay 
r(k)-»r(k)/e  after  every  1/(1-P)  time  steps,  then  we  crudely  emulate  the  effect  of 
Equation  42  iqtplied  at  every  single  time  step'*.  This  leads  to  quite  acceptable  results, 
and  it  is  the  procure  that  we  adopt  in  our  numerical  simulations. 

73.2.  Estimation  firom  the  code  vector  positions 


For  completeness,  we  shall  now  describe  how  to  make  a  crude  estimate  of  p(y)  and 
Vp(y)  from  a  single  realisation  of  the  code  vectors,  although  we  do  not  make  use  of  this 
prescription  in  our  numerical  simulations.  We  may  obtain  the  required  estimate  by 
measuring  the  zeroth  and  first  moments  (Mo(y,R)  and  M,(y,R)  respectively)  of  the  code 
vector  positions  within  a  sphere  H^-yll  <  R.  The  definition  of  these  moments  is 

Mo(y,R)H  /d4p(4)  M,(y,R)s  /d^p{4)(^-y)  (43) 

whereas  the  estimate  of  these  moments  is 


Mo(y.R)=  M,(y,R)=  (44) 

ii^yii$R  nt-yiisR 

Combining  Equation  43  and  Equation  44,  and  inserting  the  expression  in  Equation  25 
yields 


p(y) 


Ti 


Vp(y) 


N(N-h2) 


I(^-y) 

m-fiisK 


(45) 


whence  Vp(y)/p(y)  is  given  by 

X(4-y) 

Vp(.V)  N-l-2  IH-ylisR  (N+2X-y>n,..ii<B 
P(y)  R'*'  I1  “  R'" 

llt-yllsR 


where  <•■•>  denotes  an  average  over  code  vectors. 

The  optimum  choice  of  R  is  a  tradeoff.  If  R  is  too  small  then  there  are  too  few  code 
vectors  in  the  sphere  li^-yll  <  R  to  allow  a  good  estimate  to  be  made  in  Equation  44.  If  R 
is  too  big  then  we  invalidate  the  assumption  that  we  may  ignore  the  higher  order  terms 
(e.g.  curvature)  in  the  Taylor  expansion  in  Equation  25.  ^tween  these  two  extremes  will 
lie  an  optimum  choice  of  R,  whose  value  can  be  determined  by  experiment. 


'*Thit  can  ctiily  l«  checked  ei  followf  O'lfc 
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We  theoretically  derive  and  numerically  simulate  a  new  phenomenon  called  self-supervision,  in 
which  the  higher  layers  of  a  multilayer  urrsupervrsed  network  control  the  optimisation  of  the 
lower  layers,  even  when  there  is  no  external  supervising  teacher  present.  Self-supervision  is  a 
very  convenient  hyt>rid,  which  combines  the  best  properties  of  unsupervised  and  supervised 
network  training  algorithms. 
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